1. Identity statement | |
Reference Type | Conference Paper (Conference Proceedings) |
Site | sibgrapi.sid.inpe.br |
Holder Code | ibi 8JMKD3MGPEW34M/46T9EHH |
Identifier | 8JMKD3MGPEW34M/45BRTJ8 |
Repository | sid.inpe.br/sibgrapi/2021/08.31.02.34 |
Last Update | 2021:08.31.02.34.11 (UTC) administrator |
Metadata Repository | sid.inpe.br/sibgrapi/2021/08.31.02.34.11 |
Metadata Last Update | 2022:06.14.00.00.17 (UTC) administrator |
DOI | 10.1109/SIBGRAPI54419.2021.00048 |
Citation Key | Correa:2021:CoOpCh |
Title | Combination of Optical Character Recognition Engines for Documents Containing Sparse Text and Alphanumeric Codes |
Format | On-line |
Year | 2021 |
Access Date | 2024, May 06 |
Number of Files | 1 |
Size | 222 KiB |
|
2. Context | |
Author | Correa, Iago Lourenço |
Affiliation | Federal University of Rio Grande (FURG) |
Editor | Paiva, Afonso Menotti, David Baranoski, Gladimir V. G. Proença, Hugo Pedro Junior, Antonio Lopes Apolinario Papa, João Paulo Pagliosa, Paulo dos Santos, Thiago Oliveira e Sá, Asla Medeiros da Silveira, Thiago Lopes Trugillo Brazil, Emilio Vital Ponti, Moacir A. Fernandes, Leandro A. F. Avila, Sandra |
e-Mail Address | iago.correa@outlook.com |
Conference Name | Conference on Graphics, Patterns and Images, 34 (SIBGRAPI) |
Conference Location | Gramado, RS, Brazil (virtual) |
Date | 18-22 Oct. 2021 |
Publisher | IEEE Computer Society |
Publisher City | Los Alamitos |
Book Title | Proceedings |
Tertiary Type | Full Paper |
History (UTC) | 2021-08-31 02:34:11 :: iago.correa@outlook.com -> administrator :: 2022-03-02 00:54:15 :: administrator -> menottid@gmail.com :: 2021 2022-03-02 13:23:54 :: menottid@gmail.com -> administrator :: 2021 2022-06-14 00:00:17 :: administrator -> :: 2021 |
|
3. Content and structure | |
Is the master or a copy? | is the master |
Content Stage | completed |
Transferable | 1 |
Version Type | finaldraft |
Keywords | optical character recognition classifier combination pattern recognition tesseract median string |
Abstract | Many companies that buy machines, parts, or tools retain documents such as notes, receipts, forms, or instruction manuals over the years, and they may find themselves in need of digitizing these accumulated documents. Thus, when using optical character recognition (OCR) systems in these documents, it is possible to note that these systems can present two main difficulties. The first is to locate the sparse text in a non-continuous way, and the second is to match words that are closer to codes and less to words in human language. Although there are many works in the literature about sparse texts, such as forms and tables, there is usually not much concern about the issue with codes in which one can not rely on dictionaries or even both problems together. Therefore, to correct this issue without having to search for extensive databases or conduct training and development of new models, this work proposed to take advantage of pre-trained models of OCR such as from the Tesseract engine or the Google Cloud's Vision API. In order to do so, we proposed the exploration of combination strategies, including a new one based on median string. The experimental results achieved up to 3.09% improvement in character accuracy and 1.16% in word accuracy in comparison to the best individual performances from the engines when our method based on string combination was adopted. |
Arrangement 1 | urlib.net > SDLA > Fonds > SIBGRAPI 2021 > Combination of Optical... |
Arrangement 2 | urlib.net > SDLA > Fonds > Full Index > Combination of Optical... |
doc Directory Content | access |
source Directory Content | there are no files |
agreement Directory Content | |
|
4. Conditions of access and use | |
data URL | http://urlib.net/ibi/8JMKD3MGPEW34M/45BRTJ8 |
zipped data URL | http://urlib.net/zip/8JMKD3MGPEW34M/45BRTJ8 |
Language | en |
Target File | Paper ID 28.pdf |
User Group | iago.correa@outlook.com |
Visibility | shown |
Update Permission | not transferred |
|
5. Allied materials | |
Mirror Repository | sid.inpe.br/banon/2001/03.30.15.38.24 |
Next Higher Units | 8JMKD3MGPEW34M/45PQ3RS 8JMKD3MGPEW34M/4742MCS |
Citing Item List | sid.inpe.br/sibgrapi/2021/11.12.11.46 3 |
Host Collection | sid.inpe.br/banon/2001/03.30.15.38 |
|
6. Notes | |
Empty Fields | archivingpolicy archivist area callnumber contenttype copyholder copyright creatorhistory descriptionlevel dissemination edition electronicmailaddress group isbn issn label lineage mark nextedition notes numberofvolumes orcid organization pages parameterlist parentrepositories previousedition previouslowerunit progress project readergroup readpermission resumeid rightsholder schedulinginformation secondarydate secondarykey secondarymark secondarytype serieseditor session shorttitle sponsor subject tertiarymark type url volume |
|